{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Note\n",
    "\n",
    "Please view the [README](https://github.com/eclipse/deeplearning4j/tree/master/dl4j-examples/tutorials/README.md) to learn about installing, setting up dependencies, and importing notebooks in Zeppelin"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Background\n",
    "\n",
    "In this tutorial we will use a LSTM neural network to predict instacart users' purchasing behavior given a history of their past orders. The data originially comes from a Kaggle challenge (kaggle.com/c/instacart-market-basket-analysis). We first removed users that only made 1 order using the instacart app and then took 5000 users out of the remaining to be part of the data for this tutorial. \n",
    "\n",
    "For each order, we have information on the product the user purchased. For example, there is information on the product name, what aisle it is found in, and the department it falls under. To construct features, we extracted indicators representing whether or not a user purchased a product in the given aisles for each order. In total there are 134 aisles. The targets were whether or not a user will buy a product in the breakfast department in the next order. We also used auxiliary targets to train this LSTM. The auxiliary targets were whether or not a user will buy a product in the dairy department in the next order.\n",
    "\n",
    "We suspect that a LSTM will be effective for this task, because of the temporal dependencies in the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Imports\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "autoscroll": "auto"
   },
   "outputs": [],
   "source": [
    "import org.deeplearning4j.nn.api.OptimizationAlgorithm;\n",
    "import org.deeplearning4j.nn.conf.NeuralNetConfiguration;\n",
    "import org.deeplearning4j.nn.conf.Updater;\n",
    "import org.deeplearning4j.nn.conf.layers.LSTM;\n",
    "import org.deeplearning4j.nn.weights.WeightInit;\n",
    "import org.nd4j.linalg.activations.Activation;\n",
    "import org.deeplearning4j.nn.conf.layers.RnnOutputLayer;\n",
    "import org.nd4j.linalg.lossfunctions.LossFunctions.LossFunction;\n",
    "import org.deeplearning4j.nn.conf.GradientNormalization;\n",
    "import org.deeplearning4j.eval.ROC;\n",
    "import org.datavec.api.records.reader.impl.csv.CSVSequenceRecordReader;\n",
    "import org.datavec.api.records.reader.SequenceRecordReader;\n",
    "import org.datavec.api.split.NumberedFileInputSplit;\n",
    "import org.deeplearning4j.datasets.datavec.RecordReaderMultiDataSetIterator;\n",
    "import org.nd4j.linalg.dataset.api.iterator.MultiDataSetIterator;\n",
    "import org.deeplearning4j.nn.conf.ComputationGraphConfiguration;\n",
    "import org.deeplearning4j.nn.graph.ComputationGraph;\n",
    "import org.nd4j.linalg.dataset.api.MultiDataSet;\n",
    "import org.nd4j.linalg.api.ndarray.INDArray;\n",
    "import java.io.File;\n",
    "import java.net.URL;\n",
    "import java.io.BufferedInputStream;\n",
    "import java.io.FileInputStream;\n",
    "import java.io.BufferedOutputStream;\n",
    "import java.io.FileOutputStream;\n",
    "import org.apache.commons.io.FilenameUtils;\n",
    "import org.apache.commons.io.FileUtils;\n",
    "import org.apache.commons.compress.archivers.tar.TarArchiveInputStream;\n",
    "import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream;\n",
    "import org.apache.commons.compress.archivers.tar.TarArchiveEntry;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " \n",
    "\n",
    "### Download Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To download the data, we will create a temporary directory that will store the data files, extract the tar.gz file from the url, and place it in the specified directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "autoscroll": "auto"
   },
   "outputs": [],
   "source": [
    "val DATA_URL = \"https://bpstore1.blob.core.windows.net/tutorials/instacart.tar.gz\"\n",
    "val DATA_PATH = FilenameUtils.concat(System.getProperty(\"java.io.tmpdir\"), \"dl4j_instacart/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "autoscroll": "auto"
   },
   "outputs": [],
   "source": [
    "val directory = new File(DATA_PATH)\n",
    "directory.mkdir() \n",
    "\n",
    "val archizePath = DATA_PATH + \"instacart.tar.gz\"\n",
    "val archiveFile = new File(archizePath)\n",
    "val extractedPath = DATA_PATH + \"instacart\" \n",
    "val extractedFile = new File(extractedPath)\n",
    "\n",
    "FileUtils.copyURLToFile(new URL(DATA_URL), archiveFile) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will then extract the data from the tar.gz file, recreate directories within the tar.gz file into our temporary directories, and copy the files from the tar.gz file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "autoscroll": "auto"
   },
   "outputs": [],
   "source": [
    "var fileCount = 0\n",
    "var dirCount = 0\n",
    "val BUFFER_SIZE = 4096\n",
    "val tais = new TarArchiveInputStream(new GzipCompressorInputStream( new BufferedInputStream( new FileInputStream(archizePath))))\n",
    "\n",
    "var entry = tais.getNextEntry().asInstanceOf[TarArchiveEntry]\n",
    "\n",
    "while(entry != null){\n",
    "    if (entry.isDirectory()) {\n",
    "        new File(DATA_PATH + entry.getName()).mkdirs()\n",
    "        dirCount = dirCount + 1\n",
    "        fileCount = 0\n",
    "    }\n",
    "    else {\n",
    "        \n",
    "        val data = new Array[scala.Byte](4 * BUFFER_SIZE)\n",
    "\n",
    "        val fos = new FileOutputStream(DATA_PATH + entry.getName());\n",
    "        val dest = new BufferedOutputStream(fos, BUFFER_SIZE);\n",
    "        var count = tais.read(data, 0, BUFFER_SIZE)\n",
    "        \n",
    "        while (count != -1) {\n",
    "            dest.write(data, 0, count)\n",
    "            count = tais.read(data, 0, BUFFER_SIZE)\n",
    "        }\n",
    "        \n",
    "        dest.close()\n",
    "        fileCount = fileCount + 1\n",
    "    }\n",
    "    if(fileCount % 1000 == 0){\n",
    "        print(\".\")\n",
    "    }\n",
    "    \n",
    "    entry = tais.getNextEntry().asInstanceOf[TarArchiveEntry]\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### DataSetIterators"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we will convert the raw data (csv files) into DataSetIterators, which will be fed into a neural network. Our training data will have 4000 examples which will be represented by a single DataSetIterator, and the testing data will have 1000 examples which will be represented by a separate DataSetIterator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "autoscroll": "auto"
   },
   "outputs": [],
   "source": [
    "val path = FilenameUtils.concat(DATA_PATH, \"instacart/\") // set parent directory\n",
    "\n",
    "val featureBaseDir = FilenameUtils.concat(path, \"features\") // set feature directory\n",
    "val targetsBaseDir = FilenameUtils.concat(path, \"breakfast\") // set label directory\n",
    "val auxilBaseDir = FilenameUtils.concat(path, \"dairy\") // set futures directory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first initialize CSVSequenceRecordReaders, which will parse the raw data into record-like format. Because we will be using multitask learning, we will use two outputs. Thus we need three RecordReaders in total: one for the input, another for the first target, and the last for the second target. Next, we will need the RecordreaderMultiDataSetIterator, since we now have two outputs. We can add our SequenceRecordReaders using the addSequenceReader methods and specify the input and both outputs. The ALIGN_END alignment mode is used, since the sequences for each example vary in length.\n",
    "\n",
    "We will create DataSetIterators for both the training data and the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "autoscroll": "auto"
   },
   "outputs": [],
   "source": [
    "val trainFeatures = new CSVSequenceRecordReader(1, \",\");\n",
    "trainFeatures.initialize( new NumberedFileInputSplit(featureBaseDir + \"/%d.csv\", 1, 4000));\n",
    "\n",
    "val trainBreakfast = new CSVSequenceRecordReader(1, \",\");\n",
    "trainBreakfast.initialize( new NumberedFileInputSplit(targetsBaseDir + \"/%d.csv\", 1, 4000));\n",
    "\n",
    "val trainDairy = new CSVSequenceRecordReader(1, \",\");\n",
    "trainDairy.initialize(new NumberedFileInputSplit(auxilBaseDir + \"/%d.csv\", 1, 4000));\n",
    "\n",
    "val train =  new RecordReaderMultiDataSetIterator.Builder(20)\n",
    "    .addSequenceReader(\"rr1\", trainFeatures).addInput(\"rr1\")\n",
    "    .addSequenceReader(\"rr2\",trainBreakfast).addOutput(\"rr2\")\n",
    "    .addSequenceReader(\"rr3\",trainDairy).addOutput(\"rr3\")\n",
    "    .sequenceAlignmentMode(RecordReaderMultiDataSetIterator.AlignmentMode.ALIGN_END)\n",
    "    .build();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "autoscroll": "auto"
   },
   "outputs": [],
   "source": [
    "val testFeatures = new CSVSequenceRecordReader(1, \",\");\n",
    "testFeatures.initialize( new NumberedFileInputSplit(featureBaseDir + \"/%d.csv\", 4001, 5000));\n",
    "\n",
    "val testBreakfast = new CSVSequenceRecordReader(1, \",\");\n",
    "testBreakfast.initialize( new NumberedFileInputSplit(targetsBaseDir + \"/%d.csv\", 4001, 5000));\n",
    "\n",
    "val testDairy = new CSVSequenceRecordReader(1, \",\");\n",
    "testDairy.initialize(new NumberedFileInputSplit(auxilBaseDir + \"/%d.csv\", 4001, 5000));\n",
    "\n",
    "val test =  new RecordReaderMultiDataSetIterator.Builder(20)\n",
    "    .addSequenceReader(\"rr1\", testFeatures).addInput(\"rr1\")\n",
    "    .addSequenceReader(\"rr2\",testBreakfast).addOutput(\"rr2\")\n",
    "    .addSequenceReader(\"rr3\",testDairy).addOutput(\"rr3\")\n",
    "    .sequenceAlignmentMode(RecordReaderMultiDataSetIterator.AlignmentMode.ALIGN_END)\n",
    "    .build();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " \n",
    "\n",
    "### Neural Network"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next task is to set up the neural network configuration. We see below that the ComputationGraph class is used to create a LSTM with two outputs. We can set the outputs using the setOutputs method of the NeuralNetConfiguraitonBuilder. One GravesLSTM layer and two RnnOutputLayers will be used. We will also set other hyperparameters of the model, such as dropout, weight initialization, updaters, and activation functions.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "autoscroll": "auto"
   },
   "outputs": [],
   "source": [
    "val conf = new NeuralNetConfiguration.Builder()\n",
    "    .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)\n",
    "    .seed(12345)\n",
    "    .weightInit(WeightInit.XAVIER)\n",
    "    .dropOut(0.25)\n",
    "    .graphBuilder()\n",
    "    .addInputs(\"input\")\n",
    "    .addLayer(\"L1\", new LSTM.Builder()\n",
    "        .nIn(134).nOut(150)\n",
    "        .updater(Updater.ADAM)\n",
    "        .gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue)\n",
    "        .gradientNormalizationThreshold(10)\n",
    "        .activation(Activation.TANH)\n",
    "        .build(), \"input\")\n",
    "    .addLayer(\"out1\", new RnnOutputLayer.Builder(LossFunction.XENT)\n",
    "        .updater(Updater.ADAM)\n",
    "        .gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue)\n",
    "        .gradientNormalizationThreshold(10)\n",
    "        .activation(Activation.SIGMOID)\n",
    "        .nIn(150).nOut(1).build(), \"L1\")\n",
    "    .addLayer(\"out2\", new RnnOutputLayer.Builder(LossFunction.XENT)\n",
    "        .updater(Updater.ADAM)\n",
    "        .gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue)\n",
    "        .gradientNormalizationThreshold(10)\n",
    "        .activation(Activation.SIGMOID)\n",
    "        .nIn(150).nOut(1).build(), \"L1\")\n",
    "    .setOutputs(\"out1\",\"out2\")\n",
    "    .pretrain(false).backprop(true)\n",
    "    .build();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We must then initialize the neural network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "autoscroll": "auto"
   },
   "outputs": [],
   "source": [
    "val net = new ComputationGraph(conf);\n",
    "net.init();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model Training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To train the model, we use 5 epochs with a for loop and simply call the fit method of the ComputationGraph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "autoscroll": "auto"
   },
   "outputs": [],
   "source": [
    "for( epoch <- 1 to 5){\n",
    "    println(\"Epoch \"+ epoch);\n",
    "    net.fit( train );\n",
    "    train.reset();\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now evaluate our trained model on the original task, which was predicting whether or not a user will purchase a product in the breakfast department. Note that we will use the area under the curve (AUC) metric of the ROC curve.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "autoscroll": "auto"
   },
   "outputs": [],
   "source": [
    "// Evaluate model\n",
    "\n",
    "val roc = new ROC();\n",
    "\n",
    "test.reset();\n",
    "\n",
    "while(test.hasNext()){\n",
    "    val next = test.next();\n",
    "    val features =  next.getFeatures();\n",
    "    val output = net.output(features(0));\n",
    "    roc.evalTimeSeries(next.getLabels()(0), output(0));\n",
    "}\n",
    "\n",
    "println(roc.calculateAUC());"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We achieve a AUC of 0.75!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Spark 2.0.0 - Scala 2.11",
   "language": "scala",
   "name": "spark2-scala"
  },
  "language_info": {
   "codemirror_mode": "text/x-scala",
   "file_extension": ".scala",
   "mimetype": "text/x-scala",
   "name": "scala",
   "pygments_lexer": "scala",
   "version": "2.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
